Exploiting tree-based variable importances to selectively identify relevant variables
نویسندگان
چکیده
This paper proposes a novel statistical procedure based on permutation tests for extracting a subset of truly relevant variables from multivariate importance rankings derived from tree-based supervised learning methods. It shows also that the direct extension of the classical approach based on permutation tests for estimating false discovery rates of univariate variable scoring procedures does not extend very well to the case of multivariate tree-based importance measures.
منابع مشابه
Understanding variable importances in forests of randomized trees
Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a thre...
متن کاملExploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
A Boolean function f is correlation immune if each input variable is independent of the output, under the uniform distribution on inputs. For example, the parity function is correlation immune. We consider the problem of identifying relevant variables of a correlation immune function, in the presence of irrelevant variables. We address this problem in two different contexts. First, we analyze S...
متن کاملContext-dependent feature analysis with random forests
In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances f...
متن کاملتأثیر عامل سن روی متغیرهای رویشی درخت راش در جنگلهای حوضه لومیر استان گیلان
Oriental Beech forests have economic and ecological importances in Hyrcanian zone in the north of Iran. Therefore qualitative and quantitative controls of the stands are essential in management of these forests. This study was aimed for determining the effect of age on growing variables of beech trees in Lomir forest in Asalem, Guilan Province. In this study, 179 Beech trees were selected bas...
متن کاملExploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and levera...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008